Planning Stage: Data Description and Exploratory Data Analysis¶

Renata Lovette, Group 41 – Airbnb Prices in London, Rome & Budapest¶

Data Description¶

Data Source: Determinants of Airbnb prices in European cities: A spatial econometrics approach (Supplementary Material)

The dataset includes different determinants of Airbnb prices across different European cities (London, Rome and Budapest) during the weekdays and weekends. For each city, the dataset includes 19 variables that were determined via spatial econometric analysis methods.

Variable Name Data Type Description
realSum numerical price in Euros for a 2 night stay for 2 people
room_type categorical type of room (ie. shared, private)
room_shared boolean is the room shared?
room_private boolean is the room private?
person_capacity numerical maximum guest capacity in the room
host_is_superhost boolean does the host have superhost status?
multi binary does property have multiple listings?
biz binary is property hosted for business purposes?
cleanliness_rating numerical cleanliness rating of the listing
guest_satisfaction_overall numerical overall guest satisfaction rating
bedrooms numerical number of bedrooms
dist numerical distance from the city centre in km
metro_dist numerical distance from the nearest metro station in km
attr_index numerical attraction index of airbnb location
attr_index_norm numerical normalised attraction index of airbnb location (0-100)
rest_index numerical restaurant index of airbnb location
rest_index_norm numerical normalised restaurant index of airbnb location (0-100)
lng numerical longitude of airbnb location
lat numerical latitude of airbnb location

The number of observations for Airbnbs in...

  • London are 4613 (weekdays) and 5378 (weekends)
  • Rome are 4491 (weekdays) and 4534 (weekends)
  • Budapest are 2073 (weekdays) and 1947 (weekends)

Question of Interest¶

We want to examine the association between the price of Airbnb listings (response) and predictors related to the location of the Airbnb (e.g., distance from the city centre, distance from the nearest metro station, attraction index), which European city (London, Budapest, or Rome) and when during the week (e.g., weekday or weekend).

Response Variable: realSum

Exploratory Variables (not limited to): dist, metro_dist, attr_index, attr_index_norm, rest_index, rest_index_norm

Primary Focus: Prediction – we want to be able to predict how much an Airbnb listing should be priced to help Airbnb hosts price their listings competitively. Although our main focus will be prediction, inference will help us develop a better understanding of how price is affected by the different variables and could tell us about what Airbnb customers may look for in an accomodation.

Exploratory Data Analysis¶

Data Wrangling and Cleaning¶

In [1]:
# load libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(httr)
library(readr)
library(car)
library(glmnet)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: carData


Attaching package: ‘car’


The following object is masked from ‘package:dplyr’:

    recode


The following object is masked from ‘package:purrr’:

    some


Loading required package: Matrix


Attaching package: ‘Matrix’


The following objects are masked from ‘package:tidyr’:

    expand, pack, unpack


Loaded glmnet 4.1-8

In [2]:
# read datasets using URL
url <- "https://www.kaggle.com/api/v1/datasets/download/thedevastator/airbnb-prices-in-european-cities"
file <- "airbnb_data.zip"


GET(url, write_disk(file, overwrite = TRUE))
unzip(file, exdir = "airbnb_data")

# read data for individual cities
london_weekday <- read_csv("airbnb_data/london_weekdays.csv")
london_weekend <- read_csv("airbnb_data/london_weekends.csv")

rome_weekday <- read_csv("airbnb_data/rome_weekdays.csv")
rome_weekend <- read_csv("airbnb_data/rome_weekends.csv")

budapest_weekday <- read_csv("airbnb_data/budapest_weekdays.csv")
budapest_weekend <- read_csv("airbnb_data/budapest_weekends.csv")
Response [https://storage.googleapis.com:443/kaggle-data-sets/2919695/7809961/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20250328%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20250328T012043Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=b3c4212db609209300b638da94daf575c9dde2d2117aacfd03b2a91e0312bd652658e7508ad53db7b08fab074f987935992509e1663b6d074e4cf5cc53ab9af161b984ab7df51e2cfcacbd09c8f5904901a4bb12267f43493e2ad6d29709df7ef6558a3e9404f5fd46fd5c30c285e09e9a300d5c22bee3dd316871dfe36cabe905776fd7f66ce78d4b1eea1add38bbf07afa01c974b8b75d7d56ab8fb73fe7419c27ea3d1b1833303df81017003f8c0003c64f836dcc68a4076cb65e0b752b2fa374d60a069fc4efb370dfe2ab05f93fba750a2d29d48faed7cad75e645854458dd1562ae89f9a61823b144881727d514408523d7af2b1089214014e8d9ffc6f]
  Date: 2025-03-28 01:20
  Status: 200
  Content-Type: application/zip
  Size: 4.1 MB
<ON DISK>  /home/jovyan/work/stat-301/project/airbnb_data.zipNULL
New names:
• `` -> `...1`
Rows: 4614 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 5379 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 4492 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 4535 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 2074 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 1948 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In [3]:
# add city variable and weekday/weekend variable

# London Airbnbs
london_weekday['city'] <- 'London'
london_weekday['weekend_weekday'] <- 'Weekday'

london_weekend['city'] <- 'London'
london_weekend['weekend_weekday'] <- 'Weekend'

# Rome Airbnbs
rome_weekday['city'] <- 'Rome'
rome_weekday['weekend_weekday'] <- 'Weekday'

rome_weekend['city'] <- 'Rome'
rome_weekend['weekend_weekday'] <- 'Weekend'

# Budapest Airbnbs
budapest_weekday['city'] <- 'Budapest'
budapest_weekday['weekend_weekday'] <- 'Weekday'

budapest_weekend['city'] <- 'Budapest'
budapest_weekend['weekend_weekday'] <- 'Weekend'
In [4]:
# combine into 1 dataframe
london_df <- merge(london_weekday, london_weekend, all.x = TRUE, all.y = TRUE)
rome_df <- merge(rome_weekday, rome_weekend, all.x = TRUE, all.y = TRUE)
budapest_df <- merge(budapest_weekday, budapest_weekend, all.x = TRUE, all.y = TRUE)

airbnb_df <- merge(london_df, rome_df, all.x = TRUE, all.y = TRUE) %>% merge(budapest_df, all.x = TRUE, all.y = TRUE)

head(airbnb_df)
A data.frame: 6 × 22
...1realSumroom_typeroom_sharedroom_privateperson_capacityhost_is_superhostmultibizcleanliness_rating⋯distmetro_distattr_indexattr_index_normrest_indexrest_index_normlnglatcityweekend_weekday
<dbl><dbl><chr><lgl><lgl><dbl><lgl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
10121.1223Private room FALSE TRUE2FALSE00 6⋯5.73411670.4370940222.882215.493414 470.0885 8.413765-0.0497551.52570London Weekend
20156.8747Private room FALSE TRUE2 TRUE1010⋯2.97846781.5957331281.1639 6.230648 697.727215.19148612.4865441.92498Rome Weekday
30172.7725Private room FALSE TRUE2FALSE0010⋯1.22258240.3977605550.078412.1872321075.412123.43062112.5018141.88987Rome Weekend
40238.9905Entire home/aptFALSEFALSE6 TRUE0110⋯0.35935500.3526430404.404724.116552 893.477367.65685319.0507447.50076BudapestWeekday
50332.0487Entire home/aptFALSEFALSE6 TRUE0110⋯0.35937230.3526618404.398524.136091 893.418278.10079019.0507447.50076BudapestWeekend
60570.0981Entire home/aptFALSEFALSE2FALSE0010⋯5.30101781.5889904209.632614.571793 467.5975 8.372724-0.1603251.46531London Weekday
In [5]:
# check for missing values

missing_val_count <- sum(is.na(airbnb_df))
missing_val_count
0

Pre-selection of variables¶

I noticed that there is a column titled ...1 that acts as an identity/unique identifier column. So, I am going to remove it because the values have become irrelevant now that the different datasets been merged into one.

There are also multiple columns in the dataset that measure the same variable: room_type, room_shared, room_private. For this reason I choose to remove the room_shared and room_private columns to simplify and remove repeated data the same information is conveyed in the room_type variable.

In [6]:
# drop 'room_shared' and 'room_private' columns
airbnb_df <- select(airbnb_df, -(c('...1', room_shared, room_private)))
head(airbnb_df)
A data.frame: 6 × 19
realSumroom_typeperson_capacityhost_is_superhostmultibizcleanliness_ratingguest_satisfaction_overallbedroomsdistmetro_distattr_indexattr_index_normrest_indexrest_index_normlnglatcityweekend_weekday
<dbl><chr><dbl><lgl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
1121.1223Private room 2FALSE00 66915.73411670.4370940222.882215.493414 470.0885 8.413765-0.0497551.52570London Weekend
2156.8747Private room 2 TRUE10109512.97846781.5957331281.1639 6.230648 697.727215.19148612.4865441.92498Rome Weekday
3172.7725Private room 2FALSE00109311.22258240.3977605550.078412.1872321075.412123.43062112.5018141.88987Rome Weekend
4238.9905Entire home/apt6 TRUE01109910.35935500.3526430404.404724.116552 893.477367.65685319.0507447.50076BudapestWeekday
5332.0487Entire home/apt6 TRUE01109910.35937230.3526618404.398524.136091 893.418278.10079019.0507447.50076BudapestWeekend
6570.0981Entire home/apt2FALSE00109815.30101781.5889904209.632614.571793 467.5975 8.372724-0.1603251.46531London Weekday
In [7]:
# check observations in categorical levels

# roomtype per city
rm_type_per_city_count <- airbnb_df %>%
    group_by(city) %>%
    count(room_type)

rm_type_per_city_count
A grouped_df: 9 × 3
cityroom_typen
<chr><chr><int>
BudapestEntire home/apt3589
BudapestPrivate room 419
BudapestShared room 14
London Entire home/apt4384
London Private room 5559
London Shared room 50
Rome Entire home/apt5561
Rome Private room 3454
Rome Shared room 12

From the table above, the level of room_type = "Shared room" is low incomparison to the other values. If we kept this data in the dataset, there is a high chance that we would encounter problems when splitting the data into training and testing set. There is a chance that all the "Shared room" data points might fall into the testing set. Our model wouldn't recognize these points because it wouldn't have been trained on data points where room_type = "Shared room". Thus, I will drop this categorical level and remove the data points accordingly.

In [8]:
# remove observations that include room_type = "Shared room"

clean_airbnb_df <- airbnb_df %>% filter(room_type != "Shared room")

rm_type_per_city_count_2 <- clean_airbnb_df %>%
    group_by(city) %>%
    count(room_type)

rm_type_per_city_count_2
A grouped_df: 6 × 3
cityroom_typen
<chr><chr><int>
BudapestEntire home/apt3589
BudapestPrivate room 419
London Entire home/apt4384
London Private room 5559
Rome Entire home/apt5561
Rome Private room 3454

Because our question of interest uses realSum as the response variable, let's look at the range of the values of this variable.

In [9]:
# check min and max prices of airbnbs
max_price <- max(clean_airbnb_df$realSum)
min_price <- min(clean_airbnb_df$realSum)

cat("highest price: ", max_price, "lowest price: ", min_price)
highest price:  15499.89 lowest price:  34.77934
In [10]:
#check average prices of airbnbs per city
avg_prices <- clean_airbnb_df %>%
    group_by(city, weekend_weekday) %>%
    summarise("avg_price" = mean(realSum))

avg_prices
`summarise()` has grouped output by 'city'. You can override using the
`.groups` argument.
A grouped_df: 6 × 3
cityweekend_weekdayavg_price
<chr><chr><dbl>
BudapestWeekday168.5781
BudapestWeekend185.3225
London Weekday361.1261
London Weekend365.3278
Rome Weekday201.7924
Rome Weekend209.2436
In [11]:
head(clean_airbnb_df)
A data.frame: 6 × 19
realSumroom_typeperson_capacityhost_is_superhostmultibizcleanliness_ratingguest_satisfaction_overallbedroomsdistmetro_distattr_indexattr_index_normrest_indexrest_index_normlnglatcityweekend_weekday
<dbl><chr><dbl><lgl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
1121.1223Private room 2FALSE00 66915.73411670.4370940222.882215.493414 470.0885 8.413765-0.0497551.52570London Weekend
2156.8747Private room 2 TRUE10109512.97846781.5957331281.1639 6.230648 697.727215.19148612.4865441.92498Rome Weekday
3172.7725Private room 2FALSE00109311.22258240.3977605550.078412.1872321075.412123.43062112.5018141.88987Rome Weekend
4238.9905Entire home/apt6 TRUE01109910.35935500.3526430404.404724.116552 893.477367.65685319.0507447.50076BudapestWeekday
5332.0487Entire home/apt6 TRUE01109910.35937230.3526618404.398524.136091 893.418278.10079019.0507447.50076BudapestWeekend
6570.0981Entire home/apt2FALSE00109815.30101781.5889904209.632614.571793 467.5975 8.372724-0.1603251.46531London Weekday

Visualization¶

Now that the data has been cleaned and wrangled, let's visualise the datapoints using a scatterplot. Here I am interested in the following variables: realSum, metro_dist, city, weekend_weekday. I want to know how the price is affected by these variables. Using colour and shape, I differentiated the cities and whether it is a weekend or weekday listing.

In [12]:
# plot a scatterplot

options(repr.plot.width = 15, repr.plot.height = 10)  #

scatter_price_dist <- clean_airbnb_df %>%
    ggplot(aes(x = metro_dist, y = realSum, colour = city, shape = weekend_weekday)) +
    geom_point(alpha = 0.3) +
    coord_cartesian(xlim = c(0, 9), ylim = c(0, 4000)) +
    facet_wrap(~weekend_weekday) +
    geom_hline(data = avg_prices, aes(yintercept = avg_price, colour = city)) + 
    ggtitle("Visualization 1: Price of Airbnbs vs Distance from Nearest Metro Station") + 
    xlab("Distance from Nearest Metro Station (km)") +
    ylab("Price of Airbnb Listing (Euros)")

scatter_price_dist
No description has been provided for this image

The graph above shows a concentrated area of datapoints between realSum = 0 and realSum = 4000 with a distance of less than or equal to 9km from the nearest metro station. The horizontal lines indicate the mean prices for each city on weekends and on weekdays.

Analysing the visualization¶

Looking at the graph, we see that the most expensive Airbnbs can be found in London, as shown by the green points in graph scatter_price_dist. Overall, there is a negative correlation between prices of Airbnb listings and the distance from the nearest metro station. So, based on the graphs, we can say that generally, the further away a listing is from a metro station, the lower the price of the listing.

Weekend vs Weekday – Using facet_wrap(~weekend) we can compare the scatterpoints of airbnb listings during the weekend and during weekdays. We see that both plots follow very the same patterns. From the horizontal lines, we can see that the prices for airbnbs during the weekend are slightly higher than on weekdays but the difference is minimal.

Cities – At first glance, the London listings seem to be the most prominent compared to Rome and Budapest. This is expected as the dataset contained more London Airbnbs than Airbnbs in the other 2 cities. The average prices between Rome and Budapest are about the same, with Rome having a slightly greater average.

Main Takeaway from Visualization – This visualization shows that there is a negative relationship between price of Airbnbs and distance to metro stations. Airbnb listings are priced higher when they are at a closer to a metro station in these European cities. Additionally, whether an Airbnb is a weekend or weekday listing seems to have a lesser impact on price as the points seem evenly dispersed throughout the plot. So, in relating to our question of interest, this plot is important because it shows us a possible key determinant of Airbnb prices.

Methods and Plan & Computational Code and Output¶

Method of interest: Multi-covariate Linear Regression with Interaction

I will use multi-covariate linear regression in my model to predict the price of an Airbnb listing in a given city (ie. Budapest, London, or Rome) because the price is affected by many different variables including metrics related to location, capacity and ratings. A linear model works best because the response variable realSum is a continuous numerical variable. Since, I am interested in predicting prices based on the cities, city will be a key predictor in my model. The model will include interaction because variables are associated with one another. For example, the value of person_capacity is related to the number of bedrooms and the guest_satisfaction_rating would have some sort of correlation with the cleanliness_rating.

Assumptions:

  1. A linear relation between the response variable realSum and the predictors is present. From Visualization 1 above there seems to be a nonlinear relationship between realSum and metro_dist. Introducing interaction terms between covariates by implementing an interaction model and log transformations will help to address this.
  2. Errors are independent because we assume that we have a random sample of Airbnb listings.
  3. There is a normal conditional distribution of error terms, but because we have a fairly large sample size of 22,966, normality of errors may not be necessary to have valid inference results.
  4. Error terms should have equal variance for the p-value and confidence intervals to hold validity.

Potential limitations and weaknesses: With multiple linear regression, I need to be aware of multicollinearity which can cause inflation of standard errors and coefficients less interpretable. During the EDA stage, I've removed room_shared and room_private because they measured the same attribute. There may be other variables that will contribute to multicollinearity, so I will use VIF to check for highly correlated predictors.

Additionally, my model assumes linear relationships between predictors and realSum, but exploratory analysis suggests that some variables (e.g., metro_dist) may exhibit nonlinear trends. I will consider transformations or interaction terms to better capture these effects.

Moreover, Airbnb prices are influenced by other factors not included in the original dataset such as available amenities and time of year. Exclusion of these other variables may result in bias, especially if these missing variables are correlated with other predictors present.

Computational Code and Output¶

In [13]:
# splitting data into train and test sets
airbnb_train <- clean_airbnb_df %>%
    slice_sample(prop = 0.8)

airbnb_test <- clean_airbnb_df %>%
    anti_join(airbnb_train)
Joining with `by = join_by(realSum, room_type, person_capacity,
host_is_superhost, multi, biz, cleanliness_rating, guest_satisfaction_overall,
bedrooms, dist, metro_dist, attr_index, attr_index_norm, rest_index,
rest_index_norm, lng, lat, city, weekend_weekday)`
In [14]:
# check split size
cat("Training Set Size:", nrow(airbnb_train), "\n")
cat("Test Set Size:", nrow(airbnb_test), "\n")
Training Set Size: 18372 
Test Set Size: 4594 
In [15]:
# fit model with training data
MLR_model <- lm(formula = realSum ~ city * ., data = airbnb_train)
summary(MLR_model)
Call:
lm(formula = realSum ~ city * ., data = airbnb_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-1048.9   -57.9   -18.2    28.2 14104.3 

Coefficients:
                                        Estimate Std. Error t value Pr(>|t|)
(Intercept)                            1.216e+04  1.761e+04   0.690  0.48998
cityLondon                            -1.443e+03  1.878e+04  -0.077  0.93874
cityRome                              -3.016e+04  2.005e+04  -1.504  0.13249
room_typePrivate room                 -5.019e+01  1.864e+01  -2.693  0.00709
person_capacity                        1.206e+01  5.223e+00   2.310  0.02091
host_is_superhostTRUE                 -1.110e+01  1.136e+01  -0.977  0.32861
multi                                 -2.135e+01  1.313e+01  -1.625  0.10409
biz                                   -3.847e+00  1.297e+01  -0.296  0.76686
cleanliness_rating                     8.405e+00  8.543e+00   0.984  0.32525
guest_satisfaction_overall            -8.259e-01  1.116e+00  -0.740  0.45930
bedrooms                               2.699e+01  9.472e+00   2.849  0.00439
dist                                   1.306e+00  8.194e+00   0.159  0.87335
metro_dist                            -5.723e+00  1.286e+01  -0.445  0.65631
attr_index                            -2.949e+01  1.633e+02  -0.181  0.85665
attr_index_norm                        4.964e+02  2.736e+03   0.181  0.85604
rest_index                            -2.474e-01  4.418e-01  -0.560  0.57559
rest_index_norm                        3.000e+00  5.384e+00   0.557  0.57737
lng                                    3.026e+01  2.949e+02   0.103  0.91826
lat                                   -2.661e+02  3.238e+02  -0.822  0.41131
weekend_weekdayWeekend                -1.891e+00  2.380e+01  -0.079  0.93667
cityLondon:room_typePrivate room      -1.497e+02  2.040e+01  -7.340 2.22e-13
cityRome:room_typePrivate room         1.147e+01  2.086e+01   0.550  0.58263
cityLondon:person_capacity             1.945e+01  6.540e+00   2.975  0.00293
cityRome:person_capacity               1.260e+00  6.545e+00   0.193  0.84728
cityLondon:host_is_superhostTRUE       3.279e+01  1.477e+01   2.220  0.02641
cityRome:host_is_superhostTRUE         1.165e+01  1.386e+01   0.841  0.40053
cityLondon:multi                       9.528e+00  1.568e+01   0.608  0.54351
cityRome:multi                         2.279e+01  1.590e+01   1.433  0.15176
cityLondon:biz                        -8.405e+00  1.559e+01  -0.539  0.58984
cityRome:biz                           7.306e+00  1.625e+01   0.450  0.65297
cityLondon:cleanliness_rating         -9.302e+00  9.566e+00  -0.972  0.33087
cityRome:cleanliness_rating           -8.488e+00  1.051e+01  -0.808  0.41937
cityLondon:guest_satisfaction_overall  2.313e+00  1.201e+00   1.925  0.05424
cityRome:guest_satisfaction_overall    1.526e+00  1.289e+00   1.184  0.23642
cityLondon:bedrooms                    1.530e+02  1.187e+01  12.895  < 2e-16
cityRome:bedrooms                      1.656e+01  1.234e+01   1.342  0.17952
cityLondon:dist                        1.015e+01  8.640e+00   1.174  0.24023
cityRome:dist                         -1.924e+00  8.782e+00  -0.219  0.82662
cityLondon:metro_dist                 -1.379e+01  1.356e+01  -1.017  0.30934
cityRome:metro_dist                    6.402e+00  1.435e+01   0.446  0.65556
cityLondon:attr_index                  2.328e+03  2.206e+03   1.055  0.29133
cityRome:attr_index                   -6.154e+01  2.439e+02  -0.252  0.80078
cityLondon:attr_index_norm            -3.355e+04  3.176e+04  -1.056  0.29091
cityRome:attr_index_norm               3.615e+03  8.622e+03   0.419  0.67499
cityLondon:rest_index                  1.399e+01  9.185e+01   0.152  0.87893
cityRome:rest_index                   -2.320e+01  2.788e+01  -0.832  0.40525
cityLondon:rest_index_norm            -7.708e+02  5.131e+03  -0.150  0.88059
cityRome:rest_index_norm               1.075e+03  1.280e+03   0.840  0.40085
cityLondon:lng                        -2.316e+02  3.019e+02  -0.767  0.44297
cityRome:lng                          -2.533e+01  3.333e+02  -0.076  0.93941
cityLondon:lat                         5.367e+01  3.476e+02   0.154  0.87729
cityRome:lat                           6.938e+02  3.865e+02   1.795  0.07264
cityLondon:weekend_weekdayWeekend      2.519e+01  2.722e+01   0.926  0.35464
cityRome:weekend_weekdayWeekend       -1.226e-01  2.797e+01  -0.004  0.99650
                                         
(Intercept)                              
cityLondon                               
cityRome                                 
room_typePrivate room                 ** 
person_capacity                       *  
host_is_superhostTRUE                    
multi                                    
biz                                      
cleanliness_rating                       
guest_satisfaction_overall               
bedrooms                              ** 
dist                                     
metro_dist                               
attr_index                               
attr_index_norm                          
rest_index                               
rest_index_norm                          
lng                                      
lat                                      
weekend_weekdayWeekend                   
cityLondon:room_typePrivate room      ***
cityRome:room_typePrivate room           
cityLondon:person_capacity            ** 
cityRome:person_capacity                 
cityLondon:host_is_superhostTRUE      *  
cityRome:host_is_superhostTRUE           
cityLondon:multi                         
cityRome:multi                           
cityLondon:biz                           
cityRome:biz                             
cityLondon:cleanliness_rating            
cityRome:cleanliness_rating              
cityLondon:guest_satisfaction_overall .  
cityRome:guest_satisfaction_overall      
cityLondon:bedrooms                   ***
cityRome:bedrooms                        
cityLondon:dist                          
cityRome:dist                            
cityLondon:metro_dist                    
cityRome:metro_dist                      
cityLondon:attr_index                    
cityRome:attr_index                      
cityLondon:attr_index_norm               
cityRome:attr_index_norm                 
cityLondon:rest_index                    
cityRome:rest_index                      
cityLondon:rest_index_norm               
cityRome:rest_index_norm                 
cityLondon:lng                           
cityRome:lng                             
cityLondon:lat                           
cityRome:lat                          .  
cityLondon:weekend_weekdayWeekend        
cityRome:weekend_weekdayWeekend          
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 293 on 18318 degrees of freedom
Multiple R-squared:  0.2701,	Adjusted R-squared:  0.268 
F-statistic: 127.9 on 53 and 18318 DF,  p-value: < 2.2e-16
In [16]:
# check for multicolinearity using VIF test
vif(MLR_model) %>% round(4)
there are higher-order terms (interactions) in this model
consider setting type = 'predictor'; see ?vif

A matrix: 35 × 3 of type dbl
GVIFDfGVIF^(1/(2*Df))
city6.168284e+132 2802.4708
room_type1.803590e+011 4.2469
person_capacity9.950300e+001 3.1544
host_is_superhost5.314400e+001 2.3053
multi8.117600e+001 2.8491
biz8.270600e+001 2.8759
cleanliness_rating1.536770e+011 3.9202
guest_satisfaction_overall2.410430e+011 4.9096
bedrooms6.623100e+001 2.5735
dist9.665080e+011 9.8311
metro_dist3.528090e+011 5.9398
attr_index3.452707e+08118581.4623
attr_index_norm1.744563e+08113208.1914
rest_index1.296269e+041 113.8538
rest_index_norm1.456337e+031 38.1620
lng1.073485e+061 1036.0914
lat4.277705e+051 654.0417
weekend_weekday3.027740e+011 5.5025
city:room_type6.200650e+012 2.8061
city:person_capacity3.129641e+022 4.2060
city:host_is_superhost8.036500e+002 1.6837
city:multi1.980030e+012 2.1094
city:biz2.492990e+012 2.2345
city:cleanliness_rating5.334962e+042 15.1979
city:guest_satisfaction_overall8.847390e+042 17.2466
city:bedrooms1.003659e+022 3.1652
city:dist1.447825e+032 6.1685
city:metro_dist1.323315e+022 3.3917
city:attr_index3.256864e+19275543.9884
city:attr_index_norm2.005986e+19266924.0101
city:rest_index1.822657e+16211619.2001
city:rest_index_norm1.822298e+16211618.6288
city:lng1.584825e+072 63.0951
city:lat4.450921e+132 2582.9290
city:weekend_weekday3.121691e+022 4.2034
In [17]:
# use ridge to reduce high multicollinearity

airbnb_X_train <- model.matrix(object = realSum ~ city * ., data = airbnb_train)[, -1]
airbnb_Y_train <- airbnb_train$realSum

airbnb_X_test <- model.matrix(object = realSum ~ city * ., data = airbnb_test)[, -1]
airbnb_Y_test <- airbnb_test$realSum


cv_ridge_model <- cv.glmnet(x = airbnb_X_train, y = airbnb_Y_train, alpha = 0)
cv_ridge_model
Call:  cv.glmnet(x = airbnb_X_train, y = airbnb_Y_train, alpha = 0) 

Measure: Mean-Squared Error 

    Lambda Index Measure    SE Nonzero
min     13   100   86476 21492      53
1se   7697    31  107602 23499      53
In [18]:
best_lambda <- cv_ridge_model$lambda.min
best_lambda
12.5441752691265
In [19]:
ridge_final <- glmnet(x = airbnb_X_train, y = airbnb_Y_train, alpha = 0, lambda = best_lambda)
ridge_coefs <- coef(ridge_final, s = best_lambda)
ridge_coefs
54 x 1 sparse Matrix of class "dgCMatrix"
                                                 s1
(Intercept)                           -7.275602e+01
cityLondon                            -1.736851e+01
cityRome                              -2.922555e+00
room_typePrivate room                 -6.089186e+01
person_capacity                        1.233826e+01
host_is_superhostTRUE                 -8.885998e+00
multi                                 -1.253954e+01
biz                                    2.731605e+00
cleanliness_rating                     1.560157e+00
guest_satisfaction_overall             9.417247e-01
bedrooms                               4.624269e+01
dist                                   3.910691e+00
metro_dist                            -5.100026e+00
attr_index                             3.392743e-02
attr_index_norm                        2.698976e+00
rest_index                             7.640806e-03
rest_index_norm                       -8.701638e-02
lng                                    8.866412e-01
lat                                   -1.404539e-01
weekend_weekdayWeekend                 1.459653e+01
cityLondon:room_typePrivate room      -1.311924e+02
cityRome:room_typePrivate room         2.025431e+01
cityLondon:person_capacity             2.131494e+01
cityRome:person_capacity              -2.232379e-01
cityLondon:host_is_superhostTRUE       3.067100e+01
cityRome:host_is_superhostTRUE         6.143731e+00
cityLondon:multi                      -4.238407e-01
cityRome:multi                         1.362125e+01
cityLondon:biz                        -1.710014e+01
cityRome:biz                           1.471694e+00
cityLondon:cleanliness_rating          6.437983e-02
cityRome:cleanliness_rating           -7.912097e-01
cityLondon:guest_satisfaction_overall  4.341831e-02
cityRome:guest_satisfaction_overall   -1.228294e-02
cityLondon:bedrooms                    1.181335e+02
cityRome:bedrooms                     -5.483241e-01
cityLondon:dist                        1.446334e+00
cityRome:dist                         -4.818921e+00
cityLondon:metro_dist                 -8.726242e+00
cityRome:metro_dist                    5.530701e+00
cityLondon:attr_index                  1.755025e-01
cityRome:attr_index                   -3.602553e-03
cityLondon:attr_index_norm             2.459794e+00
cityRome:attr_index_norm              -9.783322e-02
cityLondon:rest_index                  8.779068e-03
cityRome:rest_index                    1.244899e-02
cityLondon:rest_index_norm             4.681783e-01
cityRome:rest_index_norm               5.747371e-01
cityLondon:lng                        -1.706205e+02
cityRome:lng                           1.147184e+00
cityLondon:lat                        -5.336878e-01
cityRome:lat                           1.792371e-01
cityLondon:weekend_weekdayWeekend     -1.475659e+01
cityRome:weekend_weekdayWeekend       -6.801898e+00
In [20]:
airbnb_full_OLS <- lm(realSum ~ city * ., data = airbnb_train)

# compare full model and ridge model
airbnb_coefs <- cbind(
  Full_OLS = as.vector(coef(airbnb_full_OLS)),
  Ridge_min = as.vector(ridge_coefs)) %>%
      round(4) %>% as.data.frame()

#airbnb_coefs
In [21]:
airbnb_test_pred_full_OLS <- predict(airbnb_full_OLS, newdata = airbnb_test)
head(airbnb_test_pred_full_OLS)
1
156.164828357232
2
140.254735874284
3
151.113978200912
4
156.237766532104
5
392.887166767934
6
179.53498815915
In [22]:
# compute RMSE of the full predictive model
airbnb_test_RMSEs <- tibble(
  Model = "OLS Full Regression",
  RMSE = mltools::rmse(
    preds = airbnb_test_pred_full_OLS,
    actuals = airbnb_test$realSum
  )
)
airbnb_test_RMSEs
A tibble: 1 × 2
ModelRMSE
<chr><dbl>
OLS Full Regression245.388
In [23]:
# compute RMSE of ridge (with best lambda) model
airbnb_test_pred_ridge <- predict(ridge_final, newx = airbnb_X_test)

airbnb_test_RMSEs <- rbind(
  airbnb_test_RMSEs,
  tibble(
    Model = "Ridge Regression with minimum MSE",
    RMSE = mltools::rmse(
       pred = airbnb_test_pred_ridge,
       actuals = airbnb_Y_test)
  )
)
airbnb_test_RMSEs
A tibble: 2 × 2
ModelRMSE
<chr><dbl>
OLS Full Regression 245.3880
Ridge Regression with minimum MSE244.9649

Analysis¶

I attempted Ridge Regression because I had really high multicollinearity and thought that using a Ridge model would help minimize this. However after further analysis, the difference between the RMSE values of a full linear regression model and the Ridge regression model is minimal. After Ridge, multicollinearity still occurs and because Ridge doesn't select and remove any variables, other models such as Lasso or stepwise selection models might be better options.

In [ ]: